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Abstract. In linear inverse problems, we have data derived from a noisy linear 
transformation of some unknown parameters, and we wish to estimate these 
unknowns from the data. Separable inverse problems are a powerful generalization 
in which the transformation itself depends on additional unknown parameters and 
we wish to determine both sets of parameters simultaneously. When separable 
problems are solved by optimization, convergence can often be accelerated by 
elimination of the linear variables, a strategy which appears most prominently 
in the variable projection methods due to Golub and Pereyra. Existing variable 
elimination methods require an explicit formula for the optimal value of the linear 
variables, so they cannot be used in problems with Poisson likelihoods, bound 
constraints, or other important departures from least squares. 

To address this limitation, we propose a generalization of variable elimination 
in which standard optimization methods are modified to behave as though 
a variable has been eliminated. We verify that this approach is a proper 
generalization by using it to re-derive several existing variable elimination 
techniques. We then extend the approach to bound-constrained and Poissonian 
problems, showing in the process that many of the best features of variable 
elimination methods can be duplicated in our framework. Tests on difficult 
exponential sum fitting and blind deconvolution problems indicate that the 
proposed approach can have significant speed and robustness advantages over 
standard methods. 



1. Introduction 

In linear inverse problems we are given a vector of noisy data b G E™ generated by the 
linear model b — Az ~\- e, where A S M'"^'= is a known matrix, e is a zero mean noise 
vector, and z e M^- is an unknown vector with = c entries we wish to estimate. 
In separable inverse problems, A is not known exactly, but depends on another set of 
parameters y G : 

b^Aiy)z + e. (1) 

The problem is now to determine the full set of iV = Ny + parameters x = {y,z). 

Many scientific inverse problems are separable. In time-resolved spectroscopy and 
physical chemistry, data are often modeled as a weighted sum of several (possibly 
complex) exponentials with unknown decay rates [HE]- Determining the weights 
and decay rates simultaneously is a separable inverse problem. Other examples 
include image deblurring with an incompletely known blur kernel [3] and tomographic 
reconstruction from incomplete geometric information 4 . Many more examples can 
be found in [Ml]- 

Separable problems frequently have additional exploitable structure. In this 
paper, we will be particularly interested in problems with multiple measurement 
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vectors generated by applying a single linear transformation to n different vectors 
of linear coefficients. In this case, the data and coefficient vectors can be represented 
by matrices 5 e M™''" and Z e R'^''", and we have 



This problem, also known as a multiple right-hand sides or multi-way data problem 
[SHlOj. occurs when a system is repeatedly observed under varying experimental 
conditions [2]. 

An inverse problem is generally solved by seeking parameter values that balance 
goodness of fit with conformity to prior expectations. In this paper we focus on 
constrained maximum likelihood problems, where we choose a goodness of fit function 
L{A{y)z) measuring discrepancy between A{y)z and b and a set X = y x Z 
representing known constraints on y and z, such as nonnegativity. We seek the 
parameter values that minimize the discrepancy subject to the constraints by solving 



Penalty functions such as £p norms on y and z may also be incorporated into F(y, z), 
and while our techniques are relevant to this case, it is not specifically addressed here. 
For the goodness of fit function we use the negative log-likelihood = — \ogp{b \ p.), 
where the likelihood function | fi) is the probability that b = ii + e and is determined 
by the distribution of e. Least squares problems result from assuming standard 
Gaussian distributed noise, so that L{p) = — but Poissonian and other 
likelihoods frequently arise. 

Unconstrained least squares problems are generally easiest to solve, and many 
powerful optimization ideas were first developed for this case However, 
unconstrained least squares solutions are not always satisfactory, and much better 
solutions can often be found using nonnegativity constraints, Poisson likelihoods, or 
other departures from ordinary least squares. Many physical quantities must be 
nonnegative, and enforcing this constraint can reduce reconstruction error [12^ and 
help the optimizer avoid unphysical answers |13j . A Poisson process is often the 
best model for a stream of particles entering a detector, and in the low-count limit 
the Poisson and Gaussian distributions are very different. In this case Poissonian 
optimization usually gives significantly better parameter estimates than least squares, 
a fact of fundamental importance in astronomy [T^lIIllIIl] , analytical chemistry [TB] , 
and biochemistry |17I18) , where information must be extracted efficiently from a trickle 
of incoming photons. This paper is concerned with advancing the state of the art for 
problems beyond least squares. 

1.1. Existing optimization methods 

We will focus on optimization methods employing Newton-type iterations. While 
other powerful methods exist for inverse problems, Newton-type methods enjoy very 
general applicability, attractive convergence properties, scalability under favorable 
conditions, and robustness against ill-conditioning and nonconvexity [11]. Given a 
smooth function /(u), a constraint set U C K^", and an initial point ^ U, 
a Newton- type method generates a sequence of iterates u^,v?',... which hopefully 
converge to the minimizer of f{u) in U, or at least a stationary point. Line search 



B = A{y)Z + E. 



(2) 




(3) 
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methods, which will be the focus of this paper, generally use the following update 
procedure to go from u'' to u*"'"*"^ [TT1I19) : 

(i) Search direction: A search direction Au is calculated by solving a Newton-type 
system of the form BAu — ~g, where g is determined from the gradient V/(u'^) 
and i? is a Hessian model approximating V^/(u'"'). Both g and B may be modified 
by information from constraints and previous iterates. 

(ii) Trial point calculation: The step Au determines a search path Up(s), parametrized 
by a step size s > 0, from which a trial point u is selected. This is generally a 
straight-line path modified to maintain feasibility with respect to constraints or 
hedge against a bad search direction. 

(iii) Evaluation and decision: If moving to the trial point produces a sufficient decrease 
in the objective, we set = u. Otherwise, another trial point is constructed, 
possibly along a new direction Au, and the process is repeated. 

This update procedure is used in the service of some larger strategy for optimizing 
F{y,z). To understand the strategies typically used, it is helpful to first consider 
strategies for solving the block-structured system BAx — —g. This system has the 
block expansion 



^yy 






'Ay 




9v 




Bzz 




Az 




9z 



(4) 



and is typically solved in one of three ways. (In the following, the product M^^w 
should be interpreted as a directive to solve Mv — w for v rather than to compute 
explicitly, and when we speak of inversion we refer to this directive.) 

(i) Full matrix, all-at-once. We solve the whole system at once by QR or Cholesky 
factorization in medium-scale problems, and by conjugate gradients (CG) in very 
large-scale problems. 

(ii) Block Gauss-Seidel. We converge to a solution by iterative updates of the form 

Ay=+' =-B;y\gy-By,Az^) (5) 
Az^+'=-B-}{g^-B,yAy^). (6) 

Gauss-Seidel is fast provided that Byy and B^^ are much easier to invert than all 
of B and a block diagonal approximation of B is reasonably accurate, but may 
be arbitrarily slow to converge otherwise [201 . 

(iii) Block Gaussian elimination. By solving for Az in the bottom row of (|4]) and 
substituting the result into the top row equation, we decompose ^ as 

BsAy = -gy + By,B-}g, (7a) 
B,,Az = -g, - B.yAy, (7b) 

where Bs = Byy — ByzB^^B^y is the Schur complement of B^z in B 21 . We 
construct the matrix Bs explicitly, solve for Ay in ([Taj, then plug the result into 
(j7bl) to solve for Az. 

Assuming B is positive definite, all three of these linear solvers can be interpreted as 
a method for minimizing the quadratic form ^Ax^'^BAx -\- g"'" Ax. Each of them can 
also be generalized to an update strategy for the nonquadratic problem ([3]), as follows: 
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(i) Full update: We update y and z simultaneously using a step derived from solving 
the full system (U). Any classical Newton-type method applied directly to F{y, z) 
falls into this category 

(ii) Alternating update: We make one or more updates to z with y fixed, then to y 
with z fixed, alternating until convergence [22 . Alternating methods now have 
well-developed convergence theory even with inexact alternating minimizations, 
and their iterations do not necessarily require matrix factorizations t23..24j . As 
such, they may be the only tractable choice for certain large-scale and highly 
non-parametric problems such as nonnegative matrix factorization. However, like 
Gauss-Seidel, alternating methods can converge slowly ^25jt26, and are generally 
preferable only when full updates are computationally expensive or intractable. 
In this paper we will focus on problems where full update methods are tractable, 
so alternation will not be considered further. 

(iii) Reduced update: We determine the optimal z value given y, 



to which the Newton- type iteration is applied, meaning that we set f{u) = Fr{y) 
instead of F(jj,z). The resulting update has a nested structure: an outer 
optimizer computes the search direction Ay and trial point y, while an inner 
optimizer calculates z by solving ([5]) whenever the outer one asks for the value of 
Fr{y) or its derivatives. 

Most reduced update methods are variations on the variable projection algorithm 
of Golub and Pereyra [51177] , which applies to the case of unconstrained separable least 
squares. In this case we have F{y,z) — ^ \\A{y)z — b\\^ and z„i{y) = A{y)^b, where 
X+ denotes the Moore-Penrose pseudoinverse. Substituting Zm{y) into F{y,z) yields 
Fr{y) = I ||— P4 where = I — XX~^ denotes the projection onto range(X)^, 
and the y in A{y) has been suppressed. Golub and Pereyra proposed using a Gauss- 
Newton method to optimize Fr{y); this requires the Jacobian for the reduced residual 
— P4 6, which they derived by differentiation of pseudoinverses. This idea can also be 
extended to accommodate linear constraints on z. 

The efficiency of variable projection in highly ill-conditioned curve fitting and 
statistical inference problems is theoretically and empirically well-attested [51I26I28II29) . 
Variable projection is also useful for problems with multiple measurement vectors 
[SHIO]. as in this case A{y) is block diagonal, so necessary pseudoinverses and 
derivatives may be efficiently computed blockwise. Other methods based on variable 
elimination can speed up the solution of large-scale image and volume reconstruction 
problems if the pseudoinverse and derivatives can be computed quickly [4[ |30H32] . 

Given the efficiency of variable elimination methods in separable least squares 
problems, one might hope to derive an extension with similar advantages to problems 
beyond least squares. However, such an extension runs into several difficulties. First, 
in problems beyond least squares there is generally no analytical formula for Zmiy), 
and computing it is often computationally expensive. Second, if inequality constraints 



Zm{y) = argmin L{A{y)z) 
zez 



(8) 



and substitute it into ([3]), giving an equivalent reduced problem 




(9) 
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or nonsmooth penalties are imposed on z, then Zjn{y) will be a nonsniooth function 
with unpredictable properties, so that the reduced problem may be even more difficult 
than the original. Third, without a formula for Zm{y) it is unclear how to compute 
Dzmiv), which is needed for a fast-converging second-order method. 

1.2. Our contribution 

Variable elimination does not seem to generalize easily to non-quadratic and 
constrained problems, but there are many efficient and robust full update methods 
for such problems This fact suggests that we might arrive at a generalization 

more easily from the other direction, by making existing full update methods resemble 
reduced update methods more closely. In this paper we explore the resulting semi- 
reduced update methods, explain how they relate to full and reduced update methods, 
describe when they are useful, and validate our claims with numerical experiments on 
hard inverse problems similar to those encountered in practice. 

In ^ we show how to transform a full update method into a reduced method 
without an explicit formula for z„i{y). We begin by applying two specific changes to 
a full update method: first, use block Gaussian elimination instead of an all-at-once 
solver, and second, adjust every new trial point's z coordinate to a better value before 
the trial point is evaluated. This second technique, which we call block trial point 
adjustment, is depicted graphically in Fig. [1] right. We call a full update method thus 
modified a semi-reduced method. Reduced methods are obtained from semi-reduced 
methods by requiring that the adjustment be optimal, which enables us to simplify 
the method by omitting computations of V^-F and the search direction Az. We show 
reduced Newton and variable projection methods can be derived in this way. In fj3l we 
propose and prove convergence of a semi-reduced method that allows for nonquadratic 
likelihoods and bound constraints on z, which has been posed as an open problem by 
multiple authors [2111]. 

The description of reduced and semi-reduced methods as modifications of full 
update methods allows us to predict when the former have advantages over the latter. 
Block Gaussian elimination is most effective when Bzz is easier to invert than all 
of B, for example when B^^ has block diagonal (Fig. [IJ, Toeplitz, banded, or other 
efficiently invertible structure. Block trial point adjustment should yield an efficiency 
gain when the computational burden of the adjustment subproblems is outweighed 
by an increase in convergence rate. This may occur when the graph of the objective 
contains a narrow, curved valley like that shown in Fig. [1] 

To test these predictions we select problems where we expect semi-reduced 
methods to have an advantage, design methods for these problems using the semi- 
reduced framework, then compare the semi-reduced methods to standard full update 
methods. In ^we derive linear algebra techniques that use block Gaussian elimination 
to exploit block structure or spectral properties of -B, and in i jS.ll and ^5.2\ we study two 
problems of scientific interest where these techniques have advantages over standard 
full-matrix methods. In t j5.3l we consider a toy blind deconvolution problem where 
block trial point adjustment leads to a significant increase in convergence rate due to 
a curved valley geometry. We conclude that semi-reduced methods can have significant 
advantages over full update methods under the predicted conditions. 
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1.3. Related work 

While the relationship between full and reduced update methods has been explored 
several times, the relationship established here is a major extension of previous work. 
In [26] Ruhe and Wedin developed the connection between full and reduced update 
Newton and Gauss-Newton methods, and semi- reduced methods are described by 
Smyth as partial Gauss-Seidel or nested methods in [28j . Our work extends theirs 
in that we consider general Newton-type methods, nonquadratic likelihoods, and 
the effect of globalization strategies, such as line search or trust regions, which 
ensure convergence to a stationary point from arbitrary initialization. A very general 
theoretical analysis of the relationship between the full and reduced problems is given 
in [33], but there is little discussion of practical algorithms and no mention of semi- 
reduced methods. 

Structured linear algebra techniques such as block Gaussian elimination are 
known to be useful [311 [33], but they are underutihzed in practice. This is 
apparent from the fact that most optimization codes employ a limited set of broadly 
applicable linear algebra techniques and very few are designed to accommodate 
user-defined linear solvers such as the ones we propose in ^ We contend that 
since significant speed gains are attainable with special linear solvers, optimization 
algorithm implementations should accommodate user-customized linear algebra by 
adding appropriate callback and reverse communication protocols. 

Trial point adjustment is a key idea in the two-step line search and trust region 
algorithms of p6] and [37]. General convergence results are proven in f38| for 
'accelerated' line search and trust region methods employing trial point adjustment. 
These works are not concerned with separable inverse problems or the relationship 
with reduced methods. 

Extensions of variable projection beyond unconstrained least squares have been 
proposed, in particular to accommodate bound constraints on z [13l|3^. Their 
approach is to apply a Newton-type method to minimize Fr{y) = F{y,Zmiy)), an 
approximation of Fr{y) = F{y,Zmiy)) obtained by computing Zm{y) approximately 
using a projected gradient or active set method. This approach can work well, but 
it has several theoretical and practical downsides. First, it has not been extended 
to nonquadratic likelihoods; second, computing Zm{y) can be very expensive, and 
the precision required is unclear; third, an appropriate Hessian model is not obvious 
and must be obtained by ad-hoc heuristics or finite differences; and fourth, there has 
been no attempt at global convergence results. In contrast, our approach works on 
nonquadratic likelihoods; it provides the option of approximating Zm{y) to any desired 
precision without danger of sacrificing convergence; one may use the same standard 
Hessian models used in full update methods, with exact derivatives if desired; and we 
prove a global convergence result for our method. 

2. Semi-reduced methods as a generalization of variable elimination 

In this section we show that a full update algorithm may be transformed into a reduced 
update (variable elimination) algorithm by introducing block Gaussian elimination 
and an optimal block trial point adjustment, then simplifying the resulting algorithm 
to remove unnecessary computation. Semi-reduced methods are those obtained 
halfway through this process, after the block techniques are imposed but before the 
simplification. We will describe the transformation process for unconstrained Newton- 
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z 

Figure 1. Situations where block gaussian elimination and trial point adjustment 
may be useful. Left: A 'block arrow' matrix B containing a block diagonal 
submatrix B^z is well-suited for inversion by block Gaussian elimination. This 
type of matrix arises in separable problems with multiple measurement vectors. 
Right: Graph of an objective F(y, z) exhibiting a narrow, curved valley; the 
minimum is marked with an X. Superimposed are a sample iterate (y^,z^) and 
an initial trial point (y^,z^) that fails a sufficient decrease test. By adjusting 
this point's z coordinate to the minimum of F{y^ , z), we obtain a new trial point 
2*+^) that provides sufficient decrease to be accepted as an update. 

type line search algorithms, but it can be done with other types of algorithms too. 

2.1. Semi-reduced methods and their simplification under optimal adjustment 

We begin the move towards semi-reduced methods by defining a standard 
unconstrained line search algorithm, then adding trial point adjustment. Let f{u) 
be a twice-differentiable function and 3§{u) G Jj^^x^^ the Hessian model, a positive 
definite matrix- valued function approximating V^/(u). 

Given an iterate u'', we obtain the update by the following procedure. We 
begin by setting g — V f{u^), B — 3§{u''), and determining the search direction Au 
by solving BAu = —g. The search direction determines a line Up{s) = u'' + sAu of 
potential trial points parametrized by step size s, and we set u'^^^ by choosing one 
that satisfies the sufficient decrease condition 

f{up{s)) - fiu") < 6g^{up{s) - u'^) = Sg^isAu), (10) 

for a fixed 6 G (0, 1/2). One can generally ensure convergence by picking a step size 
that obeys this condition and is not too small. Such a step size can be obtained by 
backtracking: we set s = and try j = 0, 1, 2, . . . until pO)) is satisfied. 

To incorporate trial point adjustment into this update procedure, we assume we 
are given an adjustment operator Ud{u) such that f{ud{u)) < f(u) for any input u. We 
then replace Up{s) with Ud{up{s)) on the left hand side of pUj) . obtaining Alg.[T] (Note 
that the standard full update method may be recovered by setting Ud(u) = u.) Global 
convergence of Alg.[T]to a stationary point is guaranteed by the following theorem: 

Theorem 1. Assume that f{u) is bounded below, V/(u) is Lipschitz continuous with 
bounded Lipschitz constant, and the matrices ^{u) are symmetric positive definite 
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with eigenvalues bounded away from zero and infinity. Then 

lim Vfiu'') = 0, (11) 

and any hmit point of (u'^)^q is a stationary point. 

This theorem is proven in |19j for the standard algorithm without trial point 
adjustment, while the extension for algorithms including trial point adjustment is 
given in [35]. Informally, trial point adjustment does not harm convergence because 
convergence requires only that /(m'^) decreases by some minimal amount for each 
iteration k, and the adjustment operator can only make the decrease larger. 



Algorithm 1 Backtracking line search method with trial point adjustment 

Input: u° e M^", 6 G (0, 1/2), a e (0, 1) 
1: for fc = 0,1,2,... do 

2: g = Wfiu''), B = .^{u^) 

3: Solve for Au: SAu = —g 

4: Up{s) — u'^ + sAu 

5: Find the smallest j > such that f{udiup{a^))) — f{u^) < Sg'^{up{a^) — u'^) 

6: u^+'^ = Ud{Up{a^)) 

7: end for 



To make Alg. [T] into a semi-reduced method for minimizing a function F{x) — 
F{y,z), we set f{u) — F{x) and put system BAx = —g into the block Gaussian 
decomposed form We then require the trial point adjustment to have the form 
Xd{y,z) — (y, Zd{y, z)), so that only z can change. The result of these changes is 
Alg.H 



Algorithm 2 Semi-reduced line search method. 

Input: x° = iy°,z°) e M^, (5 e (0, 1/2), a G (0, 1) 
1: Define Xd{y,z) = {y,Zd{y)) 
2: for fc = 0,1,2,... do 

3: g = \IF{x^),B = .S§(x^) 

4: Solve for Ay: B^Ay = -gy + By,B-}g^ 

5: Solve for Az: B^zAz — ~gz — B^yAy 

6: Define Xp{s) = {yp(s) , Zp{s)) = {y^ -f sAy, z^ + sAz) 

7: Find the smallest j > such that F{xd{xp{a^))) - F{x^) < 5g'^{xp{a') - x^) 

8: x'^+l = Xd{Xp{a')) 

9: end for 



To make Alg. [2] into a reduced update method, we assume our trial point 
adjustment is unique and optimal, Zd{y,z) = Zm{y) = argmin ^ F(y, 2;), and 
exploit this fact to simplify the algorithm. Optimal adjustments ensure that gz = 
zF{y^ , Zra{y^)) = for all k, so terms involving gz disappear. In particular, line 7 
reduces to g'^{xp{a^) — x'^) — gy{yp{ct^) —y^)- After terms involving gz are removed, 
the trial point Zp{s) = z^ -\- sAz appears only within the expression Xd{xp{a^)). But 
if we write out Xd{xp{s)) — [y^ + sAy,Zm{y^ + sAyj), we see that 2*^ + sAz has 
been supplanted by the adjusted point Zm{y^ + sAy), so we may skip it by redefining 
(s) = [y'^ -\- sAy,Zm{y'' + sAy)). The disappearance of z'' + sAz renders 
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the step Az unused in any way, so line 5 can be deleted. What is left is Alg. [3J a 
simplified semi-reduced method. In the next section we show that, when B is chosen 
appropriately, versions of this simplified semi-reduced method are identical to several 
reduced (variable elimination) methods in the literature. 



Algorithm 3 Simplified semi-reduced line search method. 

Input: x" - iy",z^{y°)) eR^,6e (0, 1/2), a G (0, 1) 
1: for A: = 0,1,2,... do 

2: gy = VyF{x''),B = ^{x'') 
3: Solve BgAy = —gy 

4: Define Xp{s) = (yp(s), Zp(s)) = (y'^ + sAy, z^iy^ + sAy)) 

5: Find the smallest j > such that F{xp{a^)) - F{x^) < Sgy{yp{a^) - y^) 

6: x^^^ = Xp{a^) 

7: end for 



Reinterpreting variable elimination as a simplified semi-reduced method allows us 
to precisely articulate the cost-benefit tradeoff involved in using variable elimination, 
as well as the raison d'etre for non-simplified semi- reduced methods. The benefit of 
variable elimination is that we need not compute g^, Az, or quantities dependent on 
them, and the trial point adjustments may cause the algorithm to converge faster. 
The cost is that we must compute the optimal z value after every y update, while 
in semi-reduced updates we only require that the adjustment does not increase the 
objective. Variable elimination is preferable only if the adjustment subproblem can be 
solved quite quickly and yields a significantly increased convergence rate. While this 
condition often holds in unconstrained least squares problems, in general calculating 
argmin ^ F(y, z) is often quite costly and may not be worth the trouble. Semi- 
reduced methods permit us to forgo this cost, granting increased flexibility without 
compromising convergence. 

2.2. Variable elimination as a simplified semi-reduced method 

Here we show that three popular reduced (variable elimination) methods can all be 
interpreted as simplified semi-reduced methods with an appropriate Hessian model. In 
other words, reduced methods can be obtained by operations on -F(y, z) alone, without 
ever forming the objective Fr{y) explicitly. This surprising result is essentially due to 
the implicit function theorem and the fact that optimization methods only use very 
limited local information about a function to determine iterates. We begin with a new 
lemma stating the exact condition required for a reduced and a simplified semi-reduced 
method to be equivalent. 

Lemma 2. Let y^ G M^" be given, and let z° = Zm{y^). Let invertible Hessian models 
^r{y) and ^f{y, z) for Fr{y) and F{y, z) be given. Assume that Zm{y) is well-defined: 
that is, there is a unique solution of min^ F(y, z) for any given y. Consider the following 
pair of Newton-type algorithms: 

(i) Reduced method: Alg. [T]with f{u) — Fr{y), yd{y) — y, — ^r- 

(ii) Simplified semi-reduced method: Alg. |3] with ^ = ^f. 

Let Bg — Byy — ByzB~^Bzy. These two algorithms generate identical iterates if and 
only if, at all points y^ visited by each algorithm, the Hessian models B^ — ^r{y) 
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and B = ^f{y, Zm{y)) obey 

Br = Bs. (12) 

Proof. After the specified substitutions are made, Algs. [T] and [3] have exactly one 
difference: the gradient used in Alg.[T]is VFr(y), while in Alg.[3]it is WyF{y, z). Thus 
it suffices to show that V Fr{y) = \i'yF{y,z). Letting Dz^ denote the Jacobian of 
Zm{y), we have 

VFriy) = VyF{y, Zmiy)) + Dz,^ ■ V,F{y, z^{y)) = Vj,F(y, z^{y)), (13) 

where the second term has vanished because z^^iy) is a stationary point of F{y, z), so 
V,F(2/,z„(2/)) -0. □ 

Now we show that the reduced Newton method (i.e. Newton's method on Fr{y)) 
can be interpreted as a simphfied semi-reduced Newton method on F{y, z). This was 
imphcitly shown by Richards [40 for the classical, nonglobalized Newton iteration. 

Proposition 3. Under the assumptions of Lemma[2l the reduced method Alg.[T]with 

Hessian model Br — V^Fr is equivalent to the simplified semi-reduced method Alg.|3] 
with model B = V^F. 

Proof. We need only verify the Schur complement relation ([T^ . Differentiating 
we have 

V^Fr^VlyF + Vl,F-Dz,^. (14) 

Dzm can be obtained by implicit differentiation of the stationary point condition 
V,i^(2/,z™(2/)) =0: 

VlyF{y, zraiy)) + Vl,F{y, z^{y)) -Dz^^Q (15) 

Dzr,,^-[Vl,F]-^VlyF (16) 

Plugging this expression into ([T4l) and setting Br — V^Fr and B — V^-F yields (fT2l) 
as desired. □ 

Now consider the separable case, where F{y,z) = L{A{y)z), but L{fi) is not 
necessarily a least squares functional. We derive two simplified semi-reduced methods 
for this objective. In the least squares case, these methods are equivalent to the 
Kaufman [IT] and Golub-Pereyra [S1[3I] variants of variable projection, but they also 
apply to general nonquadratic L, a case for which no reduced method existed before. 
To derive our methods, we note that the variable projection model Hessians Br have 
a closed-form normal decomposition: they can be written as Br — Xj Xr for some 
explicit Xr. Accordingly we will seek Hessian models B such that Bg = XjXg for 
some closed-form Xs. 

We set some notation and conventions before we begin. Let X.j the column 
of a matrix X. For any full column rank matrix X, X^ = {X'^X)~'^X'^ is the 
Moore-Penrose psuedoinverse and Pji — I — XX'^ is the orthogonal projector onto 
range(Ar)-'-. Given a function f{u,v) let Df = [duf,dyf] denote its Jacobian. To 
simplify our formulas we define the quantities ^J.{y,z) — A{y)z, W = (V^L)^^, and 
A = WA. We abuse notation by ignoring the implicit dependence of W on y and z, 
which allows us to write Wdy- A as dy. A. 
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We begin by decomposing the full Hessian of F into two components: W^F = 
G+E. The G term is the Gauss-Newton Hessian model, G = J, where J = W{Dii). 
The blocks of J are given by 



iJy):,j = {dy^A)z for j = 1, . 



a' 



J.=A. 



(17) 



The E component is a residual term given by i? = ^j(VL)iV^/ii. Note that E^z ~ 
because V^^^i — for all i. 

The first Hessian model we consider will be G. A closed-form normal 
decomposition for Gs can be derived by: 



Gs = Jj(/- 



(18) 



where the last line uses the fact that orthogonal projection is symmetric and 
idempotent, and the minus sign has been introduced for consistency with the variable 
projection convention. By Lemma[2]this result yields a pair of equivalent reduced and 
simplified semi-reduced methods for any L{pL): 

Proposition 4. The reduced method Alg.[l]with Hessian model Br — Gg is equivalent 
to the simplified semi- reduced method Alg. |3] with model B = G. 

In the least squares case we have z = Zjn{y) — so Jg — —P^{dy.A)z = 

—P^{dyjA)A^b, and this Jg is precisely the reduced Jacobian Jr proposed by 
Kaufman. Thus we have Gg — Gr and the following result, which was proven by 
Ruhe and Wedin in [33] for algorithms without globalization: 

Corollary 5. Kaufman's variable projection method is equivalent to a simplified 
semi-reduced method for separable least squares using B = G. 

Next we express the Golub-Pereyra variable projection method as a simplified 
semi-reduced method. To do this we need a Hessian model H such that Hg = Kg, 
where Kg is equal to the Golub-Pereyra reduced Jacobian Kr- This is a challenging 
problem because the Golub-Pereyra model Hr — Kj Kr is a closer approximation to 
V^i^r than the Kaufman model Gr, but there is no obvious normally decomposable 
H that approximates V^F better than the traditional Gauss-Newton model G. 

Fortunately the model may be derived by an ingenious technique due to Ruhe and 
Wedin. Essentially, their idea is to apply a block Cholesky factorization to V^F and 
use the factors to help reduce the discrepancy between G and V^F. In our notation 
the Cholesky factorization used is the U DU"^ factorization, which is simply the more 
familiar LDL^ factorization [TT] with the conventional variable order reversed. Given 
a matrix X, we write its UDU'^ factorization as X = UXU^, where 



X = 



Xg 
X,,., 



u = 



(19) 



and Xg = Xyy — Xy^X^J-Xzy Note that the U factor is determined uniquely by its 
yz block. Setting X = V^F we have Uyz = {Gyz + Eyz)G~}. 

To derive H, consider the product U~^GU~'^ , which is positive definite and 
normally decomposable because G is. If G were the true Hessian U^^GU^'^ would be 
block diagonal, but in reality 



U-^GU-^ 



-E, 



Gz 



(20) 
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Letting H denote the diagonal of U^^GU^^ , we can define a new positive definite 
and normally decomposable Hessian model by setting H = UHU^ . From it 
immediately follows that 

Hs ~ Hyy = Gs + EyzG z} Ezy- (21) 

Now we express Hs in the form Hs = KJKs- We have already decomposed 
Gs = J^Js', a similar formula for the second term, EyzG^^E^y, is given by 

EyzGj^'Ezy = EyM^Ar^Ezy = [{A+ )^ E ^yf [{A+f E ^y] = M , (22) 

where we have used the identity [X"^ X)^^ = X^{X^)'^ valid for any matrix X 
with full column rank. We now have Hg = JjJs + M , where Jg — —P^Jy and 
M = {A~^)'^E2y. Surprisingly we may rewrite this as Hs = {Js + M)'^{Js + M) because 
the cross terms vanish: jjM = -J^ P^{A+)^ E^y = for P^{X+)'^ = 0. Therefore, 
by setting Kg — Js + M, we have Hs — K^Kg as desired. 

All that remains is to compute Kg, which we do column- by-column. The j*'* 
column of Js is {Js):,j — {—P^Jy):,j — —P^{dyjA)z, while the j*'* column of E^y is 
given elementwise by 

{Ezy)kj = J2'^\7L)AA^{Az), = ^(Vi).(9,,A).fe = [{dy^AfVL]k, (23) 

i i 

SO we have M-j — {A'^)'^{dy.A)'^WL. We write this in terms of A by defining the 
weighted residual r = W~^VL, so that M^j = {A'^)'^ {dy-A)'^r. Thus the desired 
formula for i^s's columns is 

(Ks)..^, = (J,),, -t- M,, = -P^{dy^A)z + {A+fidy^Afr. (24) 

Again invoking Lemma [21 we have shown that 

Proposition 6. The reduced method Alg.[T]with Hessian model Br — Hs is equivalent 
to the simplified semi- reduced method Alg. |3]with model B — H. 

Specializing this result to the least-squares case £(/i) = | — b\\'^ as before, we 
have r ~ Az — b ~ AA'^b — b = —P^b, and {Ks):j simplifies to 

(Kg),, = - {P^{dy^A)A+ + {P^{dy^A)A+f) b, (25) 

which is precisely the Jacobian of the reduced functional F{y, Zjn{y)) = \ ||— P4 5||^ 
derived by Golub and Pereyra [5]. Since Kr = Kg, we have Hr — Hg and the desired 
equivalence: 

Corollary 7. The Golub-Pereyra variable projection method is equivalent to a 
simplified semi-reduced method for separable least squares using Hessian model H . 

2.3. Semi-reduced methods as the natural generalization of variable elimination 

Proposition [3] and Corollaries [5] and [7] show that the reduced Newton's method and 
both variants of variable projection can be interpreted as simplified semi-reduced 
methods. In addition. Propositions |4] and [6] define new simplified semi-reduced 
methods that generalize variable projection to nonquadratic L{ii). 
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Unfortunately these algorithms are of more theoretical than practical use, for the 
following reasons. First, we still have not dealt with the problem of computing Zm{y)- 
In general there is no closed form for (y) , and computing it may be so expensive that 
the computational burden outweighs any increase in convergence rate over a simple 
full or alternating update method. Second, if the domain of £i{fii) is a bounded subset 
of M, as is true for the Poisson and several other log-likelihoods, the bounds often must 
be enforced via reparametrization or constrained optimization. This adds still more 
complexity and in the latter case makes unconstrained optimization inapplicable. 

The driving technical insight of this paper is the following: if we forgo the 
simplifications afforded by using optimal block trial point adjustment and use 
an ordinary semi-reduced method instead, all of these barriers and difficulties 
disappear. Trial point adjustments need not be optimal, so there is no need for the 
computationally expensive Zmiy)-, and constraints can be handled by incorporating 
trial point adjustment and block Gaussian elimination into classical full update 
methods. Thus, semi-reduced methods provide a natural way to extend variable 
elimination methods beyond least squares. 

3. A semi-reduced method for bound constrained and nonquadratic 
problems 

In this section we present a classical method for smooth bound-constrained problems 
and turn it into a semi-reduced method. The problem we wish to solve is 

minimize fix) subject to I < x < u, (26) 

where —oo<l<u<oo are vectors bounding the components of a; € M^, and 
f{x) is twice differentiable. The method we present is a trial point adjusted variant of 
Bertsekas's projected Newton method [1^43| . (In our terminology Bertsekas's method 
is better described as a projected Newton-type method, since it allows for approximate 
Hessian models.) We choose this method because it is relatively simple, its convergence 
is global and potentially superlinear, and similar second-order gradient projection 
methods are empirically among the statc-of-thc-art for a variety of constrained inverse 
problems ^12..14..44.,45 . 

The update in the projected Newton-type method is of the form 

x''+^ ^Vix'' - S'^Vfix'')), (27) 

where iS"*^ is a scaling matrix which we will assume to be positive definite, and V{w) 
is the projection of w onto the box B ~ {w \ l < w < u] given componentwise by 

'P{w)i ^ T!\edia,n{li,Wi,Ui). (28) 

This iteration is a generalization of the projected gradient method, which restricts 
to be a multiple of the identity. Bertsekas showed that the naive Newton-type choice 
S'^ = [B'')~^ with B'' V^/(x'^) can cause convergence failures, but convergence can 
be assured by modifying the naive choice using a very simple active set strategy, in 
which the Hessian is modified to be diagonal with respect to the active indices. 

To describe projected Newton-type methods we will use the following notation. 
Let [TV] = {l,...,iV}, and for any J C [N], let vj = {vi)i^j denote the subvector 



Generalizing Variable Elimination Beyond Least Squares 



14 



of w e indexed by J and Xj j = [Xij]i j^j the indexed submatrix of X e M"^". 
Given e > and x G M", we define the active set associated with x by 

A{x) ^{ie [N] I (V/(x), > 0, x,<k + e) or (V/(x), < 0, a;, > u, - e)}, (29) 

and the inactive set as its complement, T{x) ~ [iV] — A(x). Using this notation 
we present in Alg. |4]the projected Newton- type method with trial point adjustment, 
where an identity scahng matrix is chosen on the active set for concreteness. As in the 
unconstrained case, the only difference between the semi-reduced method Alg. S] and 
the original full update method (found in equations (32)-(37) of f42|) is the addition 
of the adjustment operator Xdix): specifically, on the left hand side of line 5 and the 
right hand side of line 6 of Alg. 21 our method has Xdix^^o?)) while [42] has only 
Xpipi?^. Careful examination of the global convergence proof. Proposition 2 in |42j . 
reveals that, with very minor additions, it also establishes convergence of Alg. 01 Here 
we review the argument very briefiy, with just enough detail to describe how to adapt 
it to accommodate trial point adjustment. 

Algorithm 4 Projected Newton-type method with trial point adjustment. 

Input: e E^, (5 e (0, 1/2), a e (0, 1) 
1: for A: = 0,1,2,... do 

2: g = V/(x'=), = SS(x^) 

3: Ax/ = -{Blj)-^gi, AxA = -gA 

4: Define Xp{s) ^V{x^ + sAx) 

5: Find the smallest j > such that 

/(xrf(xp(a^)) - /(x'^) < 5 {gj{a'/\xi) + gl{xp{a^)A - 4)} (30) 

6: x''+^ = Xd{Xp{a^)) 

7: e min(eo, ||a;p(l) — x'^ll) 
8: end for 



Proposition 8. Assume that V f{x) is Lipschitz continuous on any bounded set of 
and the eigenvalues of B*' are uniformly bounded away from zero and infinity for 
all k. Then every limit point of Alg. [His a stationary point of Problem p6|) . 

Proof. By contradiction: suppose a subsequence {x'')keK of (a;'^)^Q exists such that 
limfc_5.oo,fc6X x'' — X, where x is not a critical point. Let Sk — a^'' denote the step size 
chosen at iteration k on line 5; the proof of Proposition 2 of 0^ first shows that the 
monotonicity of the sequence (/(a;'^))^g, Lipschitz continuity of Vf{x), eigenvalue 
bound on B'^, and the nonpositivity of the terms on the right hand side of (PU]) 
together imply that liminik^ao, keK Sk — 0. Since all the required properties still hold 
in our case, this conclusion holds for Alg. [Has well. Next it is shown that, for some 
s > independent of fc, we have 

f{xp{s))- f{xk) <6{gJ{sAxi)+g^{xp{s)A-x\)} for s < s. (31) 

The computation supporting this claim depends only on the properties of 
f{x),Xp{s), A, and I, so it still holds for Alg. |H But f{xd{x)) < f{x) for all x, so 

f{xdixp{s))) - f{xk) < S{gJ{sAxi) + gl{xp{s)A - x'X)} for s < s. (32) 
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It follows that Sfc > a'^ where J is the smallest nonnegative integer such that a"' < s, 
contradicting liminffe_i.oo,feeA' Sfc =0. □ 

Careful examination of the proofs in indicates that the other properties of the 
projected Newton-type method generally continue to hold for the trial point adjusted 
version, but the full details are beyond the scope of this paper. 

Since B is only required to have eigenvalues bounded away from and oo, Alg. |4] 
can accommodate a wide variety of Hessian models and regularization strategies. 
In our numerical experiments we use Alg. [SJ which is a special case of Alg. 3] and 
thus inherits its convergence properties. Alg. [S] sets B = ^{x) + XI, where ^{x) 
is a Gauss-Newton Hessian, XI is a Levenberg-Marquardt damping term, and A is 
adjusted at every iteration according to a step quality metric p. Levenberg-Marquardt 
regularization is useful for guarding against rank-deficient Hessian models jllj . 



Algorithm 5 Damped projected Newton-type method with trial point adjustment. 

Input: S e (0, 1/2), a e (0, 1), X^in, Xmax € [O,0o), < Pbad < Pgood < 1 

Input: r G (0, oo), eo € (0,oo) 
1: e ^ eo 

2: for A: = 0,1,2,... do 

3: B^ ,^(x^) + XI, V/(a:'=), A = Aix''), I = I(x'=) 

4: If ||7^(a;'^ — g) — <Torfc> kmax, stop. 

5: Solve Bj jAxj ~ —gi- Set IS.xa = —gA- 

6: Define Xp{s) ='P{x^ + sAx) 

7: Find the smallest j > such that 

f{xd{xp{a')) - fi^") < S {gJic^'^xi) + g^{xp{a^)A - 4)} 

8: X''+^ Xd{Xp{a^)) 

9: p={f{x'')~fix''+'j)/{-yjAxj) 

10: if p > Pgood then 

11: A ^ max(A/2, A„ii„) 

12: else if p < Pbad then 

13: A min(10A, Xmax) 

14: end if 

15: e ^ min(eo, ||a;p(l) — x*"'!!) 
16: end for 



While any linear algebra technique may be used to solve the Newton-type systems 
in Algs. m and EJ block Gaussian elimination is of particular interest because of the 
role it plays in our semi-reduced framework. The block Gaussian elimination methods 
used in our numerical experiments are introduced in the next section. 

4. Using block Gaussian elimination to exploit separable structure 

One of the key advantages of variable elimination methods is their ability to 
take advantage of special problem structure, such as multiple measurement vectors 
[S]. Block Gaussian elimination can be used to derive linear solvers with similar 
structure exploiting properties. Here we describe two such algorithms which we 
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claim can provide an advantage over standard methods; these claims are tested in 
our experiments below. 

The first method is a QR method for normal equations, and is thus appropriate 
for methods employing a Gauss-Newton Hessian model. This approach is most suited 
for highly ill-conditioned systems, such as those arising from exponential fitting and 
other difficult problems traditionally tackled by variable projection. Generalized 
Gauss-Newton Hessian models for nonquadratic likelihoods can be handled by this 
method [46]. The second method is for problems where z is very high dimensional (a 
vectorized image or volume array for example), while y is relatively low-dimensional. 
It is similar to the linear algebra algorithms used in the reduced update optimizers 
defined in [4lf30]. Unlike these algorithms, which are designed for specific least squares 
optimization tasks, our algorithm can be used in any Newton-type optimizer, including 
ones that handle Poisson likelihoods or bound constraints. 

Solving normal equations by block decomposed QR factorization 

We now present a method for solving normal equations by block Gaussian elimination 
and QR factorization. Normal equations are systems of the form 

J^JAx = -J'^r, (33) 

where J G E™^^. The Newton-type system BAx = —g has this form when we use a 
Gauss-Newton Hessian model or its generalization for non-quadratic likelihoods |46| . 
Assuming B = J, the reduced and damped Gauss-Newton system {Bij+XI)Axi = 
—gi from Alg. [5] can also be written in this form by deleting columns from J and 
augmenting the result with the scaled identity matrix ^/XI [TT] . 

Cholesky factorization is the fastest way to solve normal equations, but rounding 
error can amplify to unacceptable levels when J is highly ill-conditioned, as in some 
curve fitting problems. Greater accuracy can be gained at the expense of additional 
computation by QR factorizing J. Assuming J is full rank, we will write the (thin) 
QR factorization as [Q,R] = qr(J), where Q G i^^x^ jg an orthogonal matrix and 
R e M^^^ is an invertible upper triangular matrix. Substituting J — QR into (p3| 
and noting that Q-^Q — I, we obtain the solution 

Ax ^ -R-^Q'^r. (34) 

In our method we solve (l33l) by QR factorizing not the system itself, but its block 
decomposed form ([7]). We begin by putting (jT)) in normal equation form. From (|18p 
we have Bg = JjJg, where Jg — Pj~ Jy. Similarly we have —gy + By^B^^g^ = —J^r. 
From these results we can write (O as a pair of normal equations: 

{J^Js)Ay = -Jjr (35a) 
(Jj J,)Az = - Jj(r + JyAy). (35b) 

To compute Jg , we need to compute Pj- . This may be done using the QR factorization 
of J^: if X = QR is the QR factorization of a matrix X with full column rank, we 
have 

P^^I-QQ'^. (36) 
Using this result we can form Jg and solve the system as described in Alg. [SI 
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Algorithm 6 Solution of J^JAx — —J^r by block decomposed QR. 



[Q„i?,]=qr(J,) 

[t,T]=QUr,Jy] 

Jg — Jy ' Q 

[Qs,Rs] = qr(Js) 

Ay = -Rj'Qjr 



Alg. [S] is useful when has structure that makes its QR factorization easier to 
compute than that of the full J. As an example, suppose that Jz is a block diagonal 
matrix with blocks ji*^ for i = 1, . . . ,n. Such matrices arise in separable problems 
with multiple measurement vectors. In this case Qz and Rz are block diagonal and 
Alg. [S] can be adapted to exploit this, as shown in Alg. [T] Note that this algorithm 
never generates the large sparse matrix J, but only the nonzero blocks Jy^ and ji*^ , 
which are computed just when they are needed. We expect this resource economy 
to result in reduced memory usage, higher cache efficiency, and ultimately a faster 
solution. 

Algorithm 7 Alg. IH] specialized to the case of block diagonal Jz. 
1: for i — 1, . . . ,n do 

2: Compute Jy^ , ji*^ 
3; [Qi*\i?«] = qr(jP) 

5: Ji') = - Qi'V(') 

6: end for 

7: [Qs,Rs]^qr{Js) 

8: Ay = -~R-^Q^r 

9: for « = 1, . . . , n do 
10: Az(') ^ -[Ri\^{t('^ + r(*)Ay) 

11: end for 



4-2. Mixed CG/Direct method for systems with one very large block. 

In some separable inverse problems, the number of linear variables z is too large for 
direct solution by Cholesky or QR factorization. This is particularly true in image and 
volume reconstruction problems: if each pixel of a 256 x 256 pixel image is considered 
a free variable, which is very modest by imaging system standards, the relevant 
Jacobians and Hessians will be 65536 x 63356 and usually impossible to factorize or 
even store in memory. In this case conjugate gradients (CG) or other iterative linear 
algebra methods must be employed to solve the Newton- type systems BAx = —g. 
These methods only need functions that compute matrix-vector products with B, 
which may be much less memory consuming if B has special structure. Unfortunately 
the matrix B is often ill-conditioned, which can lead to slow convergence of CG. In 
some cases, Bzz is well conditioned, but the additional blocks involving the nonlinear 
variables y result in a poorly conditioned B. A method that uses iterative linear 
algebra only on the subblock Bzz has the potential to be more efiicient. 
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Such a method may be derived by solving BAx = —g in the block decomposed 
form ([7]). We first solve (|7a)) by forming the small matrix Bg = Byy — ByzB~^ B^y 
column-by-column. We solve for Ay by Cholesky factorizing this matrix, then solve 
(|7bp by CG to obtain Az, as summarized in Algorithm^ To understand when Alg. |5] 
may be more efficient than full CG, we roughly estimate and compare the costs of each 
algorithm. Let t be the total floating point operations (flops) required to compute a 
matrix- vector product with B. We split t into t ~ ty + tz, where tz is the cost of 
a matrix- vector product with Bzz, and ty is the cost of computing products with 
all three remaining blocks Byy, Byz, and Bzy Then solving BAx = —g requires 
Teg = k{ty + tz) flops, where k is the number of iterations required to achieve some 
suitable accuracy. 



Algorithm 8 Mixed CG/Direct solution of BAx = —g. 

Input: Functions that compute matrix- vector products with Byy, Byz, Bzy, Bzz- 
Inverse matrix-vector products B~^w are computed by conjugate gradients. 
1: for i — 1, . . . , Ny do 
2: {Bs):,t = Bse^ 
3: end for 

4: Calculate a Cholesky factorization R = Bg 

5: 9r = gy~ ByzB^^Qz 

6: Ay ^ R-^R-'^gr 

7: Az = ~B-,\gz-BzyAy) 



In Alg. [51 we assume that computing Bg is the dominant cost and the other 
computations are negligible, which is reasonable if Ny is significantly greater than 1. 
If kz is the number of CG iterations required to solve BzzU — w to suitable accuracy, 
then the cost of computing each column of Bg is ty + kztz, yielding a total cost of 
Tmix = Ny{ty + kztz) for all Ny columns. By setting Tmix < Teg, we see that Alg. |8] 
will outperform full CG when the iterations k required by full CG exceeds a certain 
threshold: 

k>NyhL±^. (37) 
y ' 

The right-hand side is smallest when ty is much larger than tz, kz, and Ny is relatively 
small; this corresponds to the case where Bzz is relatively well conditioned, products 
with Bzz are cheap, and there are not too many parameters in y. If ty 3> tz, then 
the threshold becomes k > Ny. This is the minimum number of iterations we would 
expect from full CG if the eigenvalues of Byy are isolated, so Alg. |8] should perform at 
least as well as full CG in this limit. However, if the spectrum of Bzz and the other 
blocks combines unfavorably, the required iterations k could be much larger, in which 
case Alg. [8] should be more efficient. 

Even when ((37)) does not hold, Alg. [8] may still be desirable for other reasons. 
For example, if B is much more ill-conditioned than Bzz, round-off error will be less 
severe in Alg. [S] than in full CG because direct linear algebra is less vulnerable to 
bad conditioning. Also, Alg. [5] is highly parallelizable because each column of Bg can 
be computed completely independently of the others, while full CG is an inherently 
sequential algorithm. 



Generalizing Variable Elimination Beyond Least Squares 



19 



5. Numerical experiments 

In this section we show how semi-reduced methods can help us solve practical 
scientific problems faster and more robustly. To this end, we consider two model 
inverse problems relevant to scientific applications. In these problems, the use of 
Poissonian likelihoods and/or bound constraints greatly increases solution accuracy, 
so the unconstrained least squares is not preferable and reduced update methods are 
not appropriate. They are also well-suited for the linear algebra methods derived 
in 21 The first problem is an exponential sum fitting problem involving multiple 
measurement vectors, and the second is a semiblind deconvolution problem from solar 
astronomy. We also solve a third problem, which is a toy model of the second problem. 
Its purpose is to show when trial point adjustment can be useful, since (as discussed 
below) we did not find it particularly useful in the first two problems. 

For each of the three problems, we selected an appropriate semi-reduced method 
and compared it to a standard full update method. In the first two problems, 
the full update method was the projected Newton-type method Alg. [5] with no 
block Gaussian elimination and no block trial point adjustment. This approach was 
compared with two alternatives; Alg. [5] with elimination off and adjustment on, and 
Alg. [S] with elimination on and adjustment off. (Elimination and adjustment act 
independently, so testing a fourth condition with both techniques switched on yields 
little additional information.) Block Gaussian elimination was performed using one 
of the methods derived in 21 while block trial point adjustments were obtained 
by performing a few iterations of Alg. [5] to approximately solve miuz^z Fiu^ , z), 
starting from the current iterate z*^. The parameters were set to 6 = 10""*, 
a = 0.2, A™„ = 10-20^ Xmax = 1020, eo = 2.2 • lO-^^^ pgood = 0.7, pw = 0.01, 
T = max(2.2 • lO"^'^, ||7'(a;° - VF(a;°)) - x°\\ /lO*) where x° is the initial point. In 
the third problem we use simpler algorithms which we describe later. All of our 
experiments were performed in MATLAB R2011a on a MacBook Pro with 2.4 GHz 
Intel Core 2 Duo processor. 

Our first finding was that trial point adjustment did not help us to solve the 
first two problems faster. Adjustment sometimes reduced backtracking and the total 
number of outer iterations needed, but not consistently or dramatically enough to 
outweigh the cost of solving inner adjustment subproblems at every iteration. As a 
result, total function evaluations and total runtime generally increased significantly 
when adjustment was used. For example, over 20 randomly sampled instances of 
the problem in ^ 35. 11 we compared a stringent inner solver {kmax = 100, r = 
||'P(z'^ — zF{y'' , z'^)) ~ z'^W /lO*) to no inner solver at all; in the former case the 
total function evaluations to solve the problem ranged from 200 — 600, while for no 
inner iterations the range was 60 — 100. We tried various intermediates between these 
two extremes — intermediate values of r, lower values of kmax, stopping early if the 
Armijo condition was satisfied before the inner iteration limit — but we always found 
that it was most efficient to simply set kmax = 0, meaning no trial point adjustment. 

For this reason we do not report any further on the effects of trial point adjustment 
in the first two problems. Instead we focus on the effects of block Gaussian elimination 
in the first two problems, and consider adjustment's effects only in the third problem. 
Note that since high-precision inner optimizations generally cause inefficiency in 
the first two problems, extensions of variable elimination that require them (such 
as [13j.39^) would be vulnerable to inefficiency in these problems, even if they could 
handle nonquadratic objectives. 
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5.1. Exponential sum fitting 

In exponential sum fitting problems, the expected value fi(t) of a physical quantity at 
time t is assumed to be the sum of c exponentially decaying components with decay 
rates j/j and nonnegative weights Zj: 

c 

Kt) ^^^Zjexpi-yjt). (38) 
i=i 

In many cases the decay rates do not vary from experiment to experiment, but the 
weights z may vary |47| . Thus, if n experiments are performed, the expected decay in 
the A;*'* experiment is 

c 

l^k{t) ^^Z.ikexp{-yjt), fc = l,...,n. (39) 

We assume that a set of m Poisson-distributed observations Bik, . . . ,Bmk of each 
jJLk{t) are made at i = ti, . . . , trn,: 

z = 1 m 

B,fc - Poisson(^fc(i,)), for ^ {" ' \ (40) 

If the columns of B and Z are stacked on top of each other to form vectors h and z, 
then the associated maximum likelihood problem is 

minimize L{{In® A{y))z) subject to 2: > 0, (41) 

where A{y)ij = exp{—yjti), is the Kronecker product, /„ is the n x n identity 
matrix, and L{fi) is the Poisson negative log-likelihood. 

Using this model we generated synthetic data which simulated the problem of 
determining several decay rates from a large collection of relatively low-count time 
series. Each time series was generated from c ~ A decaying components with rates 
(2/1)2/2 5 2/3,2/4) = (1:2,3,4) and m = 1000 uniformly spaced time samples from t — 
to 5. The number of measurement vectors was n = 100, and the nonnegative weights 
were randomly generated according to Zjk — 10 exp(1.2Zjfc), where the Zjk were 
random numbers from the standard normal distribution. A typical curve generated 
by this model is shown in Fig. [51 While this simple model does not directly represent a 
real physical problem, it generates problems similar in mathematical form, scale, and 
difficulty to problems encountered in real data analysis [101147] . In particular, each 
component has a few measurement vectors in which it dominates, but no component is 
ever observed in complete isolation. The persistent mixture of components with similar 
rates and the low signal-to-noise ratio combine to make this problem formidable. 

As we mentioned above, trial point adjustment was not useful in this problem, 
so here we compare Alg. [5] in two modes: a semi-reduced mode with block Gaussian 
elimination, and a full update mode without it. In both cases the Hessian model was 
computed using the Gauss- Newton method [46 [[i5] . and the resulting Gauss- Newton 
system was solved by QR. (Direct Cholesky factorization of the normal equations 
is not sufficiently accurate due to the notoriously poor conditioning of exponential 
fitting problems [S].) In the standard mode, the full normal equations were solved 
directly using MATLAB's built-in sparse QR routine, while in the block Gaussian 
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elimination mode, we used a MATLAB implementation of Alg. [T] MATLAB sparse 
QR employs the state-of-the-art SuiteSparseQR package [15] ■ To obtain the best 
possible performance from SuiteSparseQR, matrix-vector products with the Q factor 
were performed implicitly, and a permutation was applied to switch the blocks Jy 
and Jz- (The permutation speeds up the algorithm by an order of magnitude, 
as it enables the underlying Householder triangularization method to preserve the 
matrix's sparsity pattern.) Note that Alg. [7] has a less efficient implementation than 
SuiteSparseQR because the loops in Alg. [7] run relatively inefficiently in MATLAB, 
while SuiteSparseQR is written in C-I--I-. 

Our main finding was that block Gaussian elimination computed steps several 
times faster than sparse QR with no loss of accuracy. In a typical random instance of 
the problem described above, step computation by sparse QR factorization of the full 
Jacobian required 0.38 seconds (s), while Alg. [7] solved the system in 0.10 s, a roughly 
4- fold improvement. Since most of the algorithm's time is spent in step computation, 
the minimum was found significantly faster using block Gaussian elimination: in this 
instance, the standard mode took 18 s, while using Alg. [7]took 6 s. The accuracies of 
the two modes were functionally indistinguishable, as the objective values F{y^,z^) 
output in each mode were the same to at least 8 significant figures. From this we 
infer that the two algorithms do essentially the same mathematical operations, but 
the computer finishes the operations faster using Alg. [71 

The speed difference can be explained by two factors. First, Alg. [71 does not build 
the full J matrix, but factorizes of the n diagonal blocks of Jz just as they are needed. 
In contrast, the sparse QR algorithm must build all of J first, which takes 60 — 80% 
of the GPU time required to actually solve the system. In Alg. [Tithe blocks of Jz are 
built and factorized just-in-time, so there is no need to build a large sparse matrix. 
Second, Alg. [71 solves the overall system by solving a large number of small and very 
similar subsystems, which is more CPU and cache-friendly than operating on a large 
sparse matrix. 

The formidable difficulty of this problem, and the need for a bound-constrained 
Poissonian solver, may be appreciated by comparing the accuracy of the Poissonian 
method to a popular alternative for Poissonian problems, the variance-weighted least 
squares method. In the variance-weighted least squares method one solves 

minimize \\W[{In ® A{y))z~ h]\\l subjectto z > 0, (42) 

1 /2 

where is a diagonal matrix with Wa = l/max(6j ,e), and e = 1 is a small 
constant used to avoid division by zero [18]. We generated 100 random instances 
of the exponential sum fitting problem described above, and solved each using the 
Poissonian approach (|41|) and the weighted least squares approach ([42]) , in both cases 
using Alg. [5] The decay vector y resulting from each experiment was sorted to account 
for the problem's permutation ambiguity, resulting in 100 estimates of 2/1,2/2,2/3, and 
2/4 from each method. We then calculated the median and median absolute deviation 
of the 100 estimates of each yi from each method. (We used the median as a summary 
statistic because it is invariant to reparametrization of y and robust to the occasional 
failures of both methods.) The results are shown in Fig. [21 right, and it is clear that 
the Poissonian solver's decay rates are far more accurate. 
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Figure 2. Left: Sample data from the sum-of-exponentials model. The four 
decaying components (blue dotted lines) have decay rates yj = j for j = 1, 2, 3, 4, 
and when summed together with weights zj , these components create the expected 
intensity curve fj,{t) (solid black line). The poisson-distributed samples bi of fi{t) 
(red dots) are taken at a spacing of At = 0.005. The low available counts 
suggest a Poisson likelihood should be used. Right: Comparison of fitted and 
true decay rates yj for j = 1,2,3,4 using variance-weighted nonnegative least 
squares and Poisson likelihood. The bar heights are the median values found 
by solving 100 random problem instances, and the error bars represent median 
absolute deviations. 



5.2. Multiframe semiblind deconvolution 

Image deconvolution is a linear inverse problem in which we have an image b degraded 
by convolution with a known point spread function (PSF) /i, and we wish to undo 
the degradation to obtain the unknown clean image z. Assuming that each of these 
variables are 2D arrays supported on a square G , we can write the problem as 

Az + e = b, (43) 

where A : M^^ is the convolution operator: Az — h * z, and we assume 

periodic boundary conditions for simplicity. In multiframe blind deconvolution, there 
are several images and PSFs and the PSFs depend on unknown parameters, so that 
we have 

+e(''^' = fc"'') for fc = (44) 

If we have a parametric model of the PSFs, the problem is called semiblind. 

Here we consider a simplified, synthetic version of a real multiframe semiblind 
deconvolution problem from solar imaging, which is described in |50j . In this problem, 
a spaceborne telescope observing the Sun in the extreme ultraviolet wavelengths 
collects images which are are contaminated by stray light. The stray light effect 
is well-modeled by convolution with a single unknown parametric PSF. The telescope 
observes n images of the Moon transiting in front of the Sun, and while the Moon 
does not emit in the extreme ultraviolet (Fig. |3l top middle)^ stray light from the Sun 
spills into the lunar disk {bottom middle). Given the supports M^'^^ C of the lunar 
disks within each image, our task is to determine the PSF by solving 



mmimize 



|U(j/)z('=) - b^'''> ^ subject to fk) for fc = 1, . . . , n. (45) 

fc=l " ^MW - 



Generalizing Variable Elimination Beyond Least Squares 23 




Figure 3. Overview of tlie solar semiblind deconvolution experiment. Top left: 
The ground truth PSF profile p^^^^ir) in log-log scale, where it is piecewise linear. 
Bottom left: The ground truth PSF generated by the profile above. Top middle: 
one of the three clean lunar transit images, with lunar disk in the bottom left 
corner. Bottom middle: the observed image formed by convolving the top image 
with the PSF. Right: semilog plot of objective versus iteration (top) and CPU 
time (bottom) for the standard mode of Alg.[5]and the mode employing the mixed 
CG/Direct method. 

The PSF is modeled using two components. The PSF core is modeled by a single pixel 
with unknown value a £ (1/2, 1], while the wings are modeled by a radially symmetric 
piecewise power law ppi^r) depending on unknown parameters f3: 

hy{v) ^ aSo{v) + {1- a)p,3{\\v\\2), fori; e fi, (46) 

where do is the Kronecker delta. To define the piecewise power law, we set pp{0) = 0, 
then for r > we set logarithmically spaced breakpoints {ri)f^Q defining S" = 12 
subintervals, starting from tq = 1 and ending at rs = where s is the sidelength 
of the square 51. On each subinterval [ri-i, ri), the formula is given by pp{r) oc r~^*, 
where /3i > 0, and /3 = (/3i, . . . , /Jg). The proportionality constants are determined 
by a continuity constraint between subintervals and the normalization constraint 
Xlw^'iSdl^ll) = 1- The free parameters of the PSF model hy are then y = (a,/3) € M^f, 
where Ny = 1 + 5 = 13. The true profile ppir) was generated using /3 values similar 
to those in [SO], and is shown in log- log scale in (Fig. |31 top left), with the resulting 
PSF directly below. 

We used data from the STEREO-EUVI satellite to generate n = 3 synthetic lunar 
transit images of size 256 x 256. To simulate the Moon's transit, a disk of pixels was 
set to zero in each image. These images were convolved with the ground truth PSF 
to create the blurry observations, and no noise was added for simplicity. (Noise is not 
a very important issue in this problem because the real data have very little noise at 
this resolution, deconvolution with the PSF is quite well-conditioned for a > 1/2, and 
the expected range of a is well above this.) 

As before, Alg.[5]was run in two modes: a semi- reduced mode employing Gaussian 
elimination, and a standard one without. In the standard mode of Alg. [SJ the 
search direction was calculated by CG on the full system BAx = —g. Preliminary 
experiments revealed that the full CG algorithm was very slow. However the situation 
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improved substantially when we used a scalar preconditioner cl on the y block, where 
c = 10^ was found to work well. All CG iterations were stopped at a relative residual 
tolerance of 10~^ and maximum iteration ceiling 40, as these values worked relatively 
well for the full CG algorithm. In the block Gaussian elimination mode, the search 
direction was calculated using the mixed CG/Direct algorithm, Alg. [51 The mixed 
CG/Direct algorithm required no special tuning or preconditioners. 

Our main finding was that the block Gaussian elimination mode using Alg. |8] 
converged quite quickly and robustly, while the standard mode experienced a long 
period of sluggish convergence after an initially fast descent (Fig. [31 right). The 
average CPU time per step was about the same in each of the two modes, so we can 
attribute the block Gaussian elimination mode's superior performance to better search 
directions, which enabled convergence in far fewer iterations than the full CG mode. 

The better search directions of the block Gaussian elimination mode can be 
explained by considering the unusual spectrum of the Gauss-Newton Hessian. It has 
two very different components: a large cluster of « near-unity eigenvalues due to 
the very well-conditioned B^z block, and a sprinkling of « Ny eigenvalues contributed 
by the other three blocks. The latter are, to put it lightly, less tame: they can easily 
spread over 15 orders of magnitude and move unpredictably as the iterations progress. 

Naively, we might expect full CG to make short work of such a system. We simply 
apply a scalar preconditioner to the badly behaved blocks involving Ay, pushing the 
Ny scattered eigenvalues to lie above the cluster. Then the spectral theory of 
CG predicts convergence in Ny + iterations, where is the number of iterations 
required to make the CG spectral polynomial nearly zero on the N^ cluster [5T]. We 
expect kz to be small because the Nz cluster is very tightly centered around unity. 

In practice, however, it is difficult to know in advance where the mobile eigenvalues 
will be, and their enormous spread raises issues of rounding error. Thus it is difficult 
to get good solutions out of full CG, and the search directions suffer, causing sluggish 
convergence. In constrast, the mixed CG/Direct algorithm applies CG to the well- 
conditioned Bzz block alone, and deals with the other blocks by direct linear algebra. 
Since direct linear algebra is much less susceptible than CG to ill-conditioning and 
rounding error, the result is high-quality search directions and quick convergence. 

5.3. A model semihlind deconvolution problem for block trial point adjustment 

Given the failure of trial point adjustment to speed up the solution of the previous two 
problems, the reader may wonder if it has any application beyond its theoretical role in 
the connection between full and reduced update methods. The literature suggests that 
adjustment certainly can increase convergence rate and robustness [26ll28l[29[l311l52) . 
However the speed gains relative to standard methods are highly variable: adjusted 
methods are slower in our experiments, a factor of 2 or 3 times faster in certain 
image processing problems, and multiple orders of magnitude faster in some difficult 
curve fitting problems. Clearly adjustments must be adapted to the problem at hand, 
but it is difficult to predict when it will be useful. Here we present a toy semiblind 
deconvolution problem similar to the one solved in the previous section, and show that 
trial point adjustment is valuable for solving this problem in the most difficult cases. 

As in the solar problem, our toy problem involves semiblind deconvolution of an 
extended, uniformly bright object which has been convolved with a long-range kernel. 
The true image u* and kernel /i* are both 1-D signals of length m supported on 
{—j, . . . , j}, where m = 2j/ + 1. They come from single-parameter signal families given 
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by hy{p) — y5o{i) + (1 — y)^l{i) and u^ii) — z ■lg{i), where ls(i) is the indicator for 
the set 5* = {—£, ■ ■ ■ ,i} of size s — 2£ + 1, l{i) is the constant ones function. Letting 
(y*, z*) denote the unknown true parameter values, the problem is to determine (y*, z*) 
from the blurry observation / = /i* * u*, where periodic convolution and no noise is 
assumed. The values of y* and 2* can be found by minimizing the difference between 
hy * Uz and / with respect to some loss function, which we choose as the Huber loss 

with threshold t = 0.3. We choose this loss function simply because it is a common 
nonquadratic loss and the optimization phenomenon of interest occurs when it is used. 
Noting that physically we must have < y < 1 and z > 0, we obtain the optimization 
problem 

min ^ ^ (F(y, z) ^ ^ ^((/i, * - /),)). (48) 



0<y<l,z>0 



A simple formula for F{y, z) can be found by observing that both the prediction hy*Uz 
and / take only two values. Letting p = s/m be the ratio of the object support to 
signal size, (71 (y, z) = yz + (1 — yz)p the predicted value of the blurry image on the 
support, q2{y,z) = (1 — y)zp the predicted value off the support, and q* = qi{y^,z*') 
for j = 1, 2, we have 

F(y, z) = - {pl{qi{y, z) - q{) + (1 - p)i{q2{y, z) - q\)) . (49) 

The to/2 scale factor does not affect the location of the minimum nor the path of 
any of the optimization algorithms we consider here. Therefore, for our purposes, the 
parameter p is effectively the only free parameter in the problem family. We use the 
values p = 10~^, 10~® to create two objectives whose graphs are depicted in Fig.|4l far 
left. As p — 0, the term p£{qi{y, z) — qf) vanishes, the (1 — p)£{q2{y, z) — 93) becomes 
dominant, and the objective landscape becomes a narrow, hyperbolic trench. 

We solved this problem at both values of p, and for each value we used Alg. 4 in 
a full update mode (without trial point adjustment) and a semi-reduced mode (with 
trial point adjustment). In the latter case, the block trial point adjustment used was a 
single iteration of Alg. 4 to minimize i^(y, z) in z with y fixed. Algorithm parameters 
were chosen as in the previous section. 

The paths taken by the full and semi-reduced methods are shown in Fig. 21 center 
left and right. We observe that the methods take nearly identical paths when p = 10~^, 
but when p — 10^^ the full update method is forced to take very small steps. At far 
right, the distance to the minimum, [[(y'^, z'^) — (y*, z*)!!^, is plotted versus iteration k 
for each method. The superior convergence rate of the semi-reduced method is clear 
when p = 10~^. 

The behavior of each algorithm can be understood by considering the geometry 
of the steps it takes. The full update method takes steps along straight lines. Straight 
lines cannot follow a curved trench for long, so there is an upper bound on the size of 
an admissible step. As p — )■ 0, the trench tightens and the admissible steps become 
very small, so that progress is very slow. The semi-reduced method takes a 'dogleg' 
step as illustrated in Fig. [U which enables it to stay in the valley. 

To avoid the small admissible step issue that stymies the full method, it is critical 
that adjustment be done before the trial point is evaluated. This is the key feature 
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objective full semi-reduced error 




Figure 4. Comparison of alternating, joint, and nested descent strategies on toy 
blind deconvolution problem. Row 1: Plots of F{y,z) for p = 10~^ (top) and 
10"'' (bottom), logarithmic greyscale. The white crosses mark the minimum at 
{y^,z^) = (0.7, 1), where F = 0. Center left and right: The iterates of the full 
and semi-reduced methods for each p value, starting from {y'' , z^) = (0.02,0.02). 
Far right: semilogarithmic plot of the error ||(y'°,2*) — (j;*,2:*)||2 versus iteration 
k for the full method (dashed line) and scmi-rcduced method (solid line). 

distinguishing semi-reduced methods from other methods, such as simple alternation 
between a full update and a partial update. Other strategies, such as nonmonotone line 
search [53j and greedy two-step methods [37] , have a similar step structure and could 
also work on this problem; however it is unclear if they can match the semi-reduced 
method's complete insensitivity to the value of p. 

It is important to note that the phenomenon we have described here does not occur 
for all loss functions £{x). For example, we found that if the Poisson log-likelihood 
is used, the objective landscape does not have such a tight curved valley, the full 
update method solves the problem quite efficiently, and the semi-reduced method's 
inner iterations expend effort without benefit. Curved valleys are thus an occasional 
problem with potentially severe efficiency consequences. The semi-reduced framework 
seems appropriate for dealing with such a problem, since one has the option to perform 
inner descent iterations only when necessary. 

6. Conclusion 

Reduced update optimization methods, which are based on variable elimination, 
have been found to be particularly fast and robust in certain difficult separable 
inverse problems. Unfortunately, using them in problems beyond unconstrained 
least squares presents serious theoretical and practical difficulties, in particular the 
need for expensive optimal trial point adjustments and complex derivatives of a 
reduced functional. We have described a new class of semi-reduced methods which 
interpolate between full and reduced methods. Semi-reduced methods share the 
desirable characteristics of reduced methods while being flexible enough to avoid their 
downsides. A key advantage of the semi-reduced framework is the flexibility to use 
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adjustments where they are useful and avoid them where they are not, all within a 
single convergent method. 

We began by reinterpreting reduced methods as full update methods that have 
been modified and simplified. We showed that if block Gaussian elimination and 
an optimal block trial point adjustment are used within a full update method, the 
adjustment's optimality renders certain computations unnecessary. Removing these 
unnecessary computations yields a simplified method that turns out to be equivalent to 
a reduced method. To confirm that this reinterpretation of reduced update methods is 
correct and generally applicable, we derived the well-known reduced update Newton 
and variable projection methods using our modification and simplification process. 
We defined semi-reduced methods by omitting the final simplifications, which frees us 
from the need to perform expensive optimal block trial point adjustments. We then 
incorporated block Gaussian elimination and trial point adjustment into an algorithm 
for general bound constrained problems, and showed that its convergence follows 
almost immediately from the convergence theorem for the original method. Finally, 
we showed that many of the structure-exploiting properties of variable elimination can 
be obtained by using appropriate block Gaussian elimination algorithms. 

Block Gaussian elimination is suited for problems where the Hessian model's Bzz 
subblock is block diagonal, circulant, banded, or has other exploitable structure. We 
described two situations where we expected block Gaussian elimination to outperform 
a standard all-at-once method, and these expectations were borne out in numerical 
experiments on realistic problems derived from the scientific inverse problem literature. 
It is notable that both of the methods we presented involve the solution of many 
independent subproblems and are thus ideal candidates for parallelization. 

Block trial point adjustment is appropriate when we expect the graph of F{y, z) 
to contain a narrow, curved valley. Trial points from full update methods tend to leave 
the valley and thus will be rejected unless a trial point adjustment is used to return 
to the valley. In our first two numerical experiments trial point adjustments turned 
out to be computationally wasteful, so it was critical that we had the flexibility to 
perform suboptimal adjustments or even none at all (which turned out to be the best 
option) . In our third experiment we presented a reasonable toy inverse problem where 
the curved valley effect was significant enough to warrant trial point adjustment, but 
the parameter values where this occurred were somewhat extreme. Since the curved 
valley effect is important in some real problems [37l|52], a better understanding of 
precisely when it occurs would be useful. 
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